Activité des dépôts "Package R" par source de package

Nous allons étudier l'activité des dépôts R par source, autrement dit : en distinguant les dépôts qui hébergent un paquet en provenance de CRAN, de BioConductor, de R-Forge et enfin, ceux qui sont uniquement présents sur Github.


In [28]:
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('pdf')

In [29]:
import pandas

packages = pandas.DataFrame.from_csv('../data/R-Packages.csv')

In [30]:
packages = packages[['bioconductor', 'cran', 'rforge', 'github', 'Github only', 'canonical', 'creation', 'last_push']]
packages = packages.query('github == 1 and canonical == 1').copy()

In [31]:
packages['creation'] = pandas.to_datetime(packages['creation'])
packages['last_push'] = pandas.to_datetime(packages['last_push'])

Par date de création


In [32]:
creations = packages.set_index('creation')
creations = creations.sort_index()

In [33]:
_ = creations[['github', 'Github only', 'cran', 'bioconductor', 'rforge']]
y = _.rename(columns={'github': 'Overall', 'Github only': 'Only on Github', 'rforge': 'R-Forge'})
x = y.cumsum()
_ = x.plot(figsize=(15,6), style=['k--'], logy=True, title='Accumulated number of newly created repositories on Github\n')
_.set_xlabel('creation date')
_.set_ylabel('accumulated number of packages')


Out[33]:
<matplotlib.text.Text at 0x7f26a3423ad0>
<matplotlib.figure.Figure at 0x7f26a3b7bb50>

In [34]:
t = pandas.stats.moments.rolling_sum(y, freq='1M', how='sum', min_periods=0, window=1)['2012-01-01':'2014-12-31']
t = t.plot(figsize=(9,5), ylim=(0,400), style=['k--'], title='Newly created repositories on GitHub, by month\n')



t.set_xlabel('creation date')
t.set_ylabel('number of repositories')
t.legend(('GitHub', 'GitHub \ (CRAN $\cup$ BioConductor $\cup$ R-Forge)', 'GitHub $\cap$ CRAN', 'GitHub $\cap$ BioConductor', 'GitHub $\cap$ R-Forge'), loc='best')


Out[34]:
<matplotlib.legend.Legend at 0x7f26a2f609d0>
<matplotlib.figure.Figure at 0x7f26a384cf90>

Par durée de vie


In [35]:
last_push = packages.set_index('last_push')
last_push = last_push.sort_index()

In [36]:
_ = last_push[['github', 'Github only', 'cran', 'bioconductor', 'rforge']]
_ = _.rename(columns={'github': 'Overall', 'Github only': 'Only on Github', 'rforge': 'R-Forge'})
_ = _.cumsum()
_ = _.plot(figsize=(15,6), style=['k--'], logy=True, title='Accumulated number of inactive repositories on Github\n')
_.set_xlabel('last PushEvent date')
_.set_ylabel('accumulated number of repositories')


Out[36]:
<matplotlib.text.Text at 0x7f26a2eb98d0>
<matplotlib.figure.Figure at 0x7f26a31a3190>

Par (avant dernière) activité


In [37]:
def filtered_count(df, category, date):
    date6 = date - pandas.DateOffset(months=3)
    return df[df['creation'] <= date][df['last_push'] >= date6][category].sum()

In [38]:
active = creations.copy()
active['creation'] = active.index

temp = pandas.DataFrame(index=pandas.bdate_range('2008-1-1', periods=84, freq='M'))

for date in temp.index:
    for cat in ['github', 'rforge', 'bioconductor', 'cran', 'Github only']:
        value = filtered_count(active, cat, date)
        temp.at[date, cat+' active'] = value

In [39]:
_ = temp['2008-1-1':'2014-10-01'].plot(figsize=(15,6), style=['k--'],title='Number of active repositories on Github\n')
_.set_xlabel('date')
_.set_ylabel('number of repositories')


Out[39]:
<matplotlib.text.Text at 0x7f26a2cc9810>
<matplotlib.figure.Figure at 0x7f26a3062f10>